-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Dynamic enablement of Per-Partition Automatic Failover #46477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…afDynamicEnablement # Conflicts: # sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/PerPartitionAutomaticFailoverE2ETests.java # sdk/cosmos/azure-cosmos/CHANGELOG.md
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
…afDynamicEnablement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements dynamic enablement of Per-Partition Automatic Failover (PPAF) in the Azure Cosmos DB Java SDK. The key enhancement allows PPAF to be enabled/disabled at runtime based on service-side configuration, moving away from static client-side configuration.
Key changes include:
- Enhanced GlobalEndpointManager to monitor and respond to database account configuration changes
- Added dynamic PPAF configuration capability through a callback mechanism
- Extended user agent feature flags to include ThinClient and Http2 support
- Improved thread safety in UserAgentContainer with read-write locks
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
File | Description |
---|---|
GlobalPartitionEndpointManagerForPerPartitionCircuitBreaker.java |
Added clear() method and synchronized resetCircuitBreakerConfig |
GlobalPartitionEndpointManagerForPerPartitionAutomaticFailover.java |
Added synchronized resetPerPartitionAutomaticFailoverEnabled with state clearing |
UserAgentFeatureFlags.java |
Added ThinClient and Http2 feature flags |
UserAgentContainer.java |
Enhanced thread safety with read-write locks and improved flag handling |
RxGatewayStoreModel.java |
Added gateway response recording for cancelled requests |
RxDocumentClientImpl.java |
Refactored PPAF initialization to support dynamic enablement |
GlobalEndpointManager.java |
Added dynamic PPAF monitoring and callback mechanism |
DiagnosticsClientContext.java |
Updated diagnostics to track dynamic PPAF state |
ReflectionUtils.java |
Added reflection utilities for testing GlobalEndpointManager owner |
PerPartitionAutomaticFailoverE2ETests.java |
Added comprehensive end-to-end tests for dynamic PPAF enablement |
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/UserAgentContainer.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/UserAgentContainer.java
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java
Show resolved
Hide resolved
...cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/GlobalEndpointManager.java
Outdated
Show resolved
Hide resolved
...azure-cosmos-tests/src/test/java/com/azure/cosmos/PerPartitionAutomaticFailoverE2ETests.java
Show resolved
Hide resolved
…afDynamicEnablement
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
…afDynamicEnablement
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run java - cosmos - tests |
/azp run java - cosmos - tests |
/azp run java - cosmos - spark |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run java - cosmos - kafka |
Azure Pipelines successfully started running 1 pipeline(s). |
1 similar comment
Azure Pipelines successfully started running 1 pipeline(s). |
|
||
logger.warn("Availability strategy for reads, queries, read all and read many" + | ||
" is enabled when PerPartitionAutomaticFailover is enabled."); | ||
logger.warn("As Per-Partition Automatic Failover (PPAF) is enabled a default End-to-End Operation Latency Policy will be applied for read, query, readAll and readyMany operation types."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this really have to be in warn-level (I know it was before - just asking)?
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM excdept for the log-level comments.
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
if (!Configs.isHttp2Enabled()) { | ||
userAgentFeatureFlags.remove(UserAgentFeatureFlags.Http2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we only remove the http2 flag if both conditions match:
- !Configs.ishttp2Enabled
- !Http2ConnectionConfig.isEnabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xinlian12 We have two independent paths to HTTP/2 enablement. I've made it more explicit - could you take another look?
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java
Show resolved
Hide resolved
...cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/GlobalEndpointManager.java
Show resolved
Hide resolved
...cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/GlobalEndpointManager.java
Outdated
Show resolved
Hide resolved
...mos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/DiagnosticsClientContext.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java
Show resolved
Hide resolved
.../perPartitionCircuitBreaker/GlobalPartitionEndpointManagerForPerPartitionCircuitBreaker.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
…afDynamicEnablement # Conflicts: # sdk/cosmos/azure-cosmos/CHANGELOG.md
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
Description
This pull request introduces a way to dynamically allow a
Cosmos(Async)Client
to be per-partition automatic failover capable when the service is per-partition automatic failover capable without the requirement of app-side client restarts.Approach taken
A callback is passed from
RxDocumentClientImpl
, which is responsible for:GlobalPartitionEndpointManagerForPerPartitionCircuitBreaker
GlobalPartitionEndpointManagerForPerPartitionAutomaticFailover
Invocation Context
This callback is invoked within
GlobalEndpointManager
, which:DatabaseAccount
payloadenablePerPartitionFailoverBehavior
boolean flagBehavior on Change
If a change in
enablePerPartitionFailoverBehavior
is detected:true
tofalse
, both PPAF and PPCB are disabled, and any cached failover information is cleared. Cross region availability strategy is disabled.Other changes
QueryPlan
calls.Testing done
Relevant Integration Tests
PerPartitionAutomaticFailoverE2ETests#testPpafWithWriteFailoverWithEligibleErrorStatusCodesWithPpafDynamicEnablement
PerPartitionAutomaticFailoverE2ETests#testFailoverBehaviorForNonWriteOperationsWithPpafDynamicEnablement
Summary
Both tests validate the following:
Initial State:
Failover does not occur when
enablePerPartitionFailoverBehavior
is reported asfalse
.Dynamic Enablement:
Without restarting or reinitializing the
CosmosAsyncClient
instance, aDatabaseAccount
refresh is triggered.Post-Refresh Behavior:
After the refresh,
enablePerPartitionFailoverBehavior
is reported astrue
.At this point:
Relevant end-to-end tests done
A workload (
Query
,Read
andCreate
) was run against a non-PPAF enabled account. At roughly 18:21 EST, a complete quorum loss was triggered inNorth Central US
. Post this, the account was enabled with PPAF (at roughly 18:28 was when theCosmosClient
instance got theDatabaseAccount
payload withenablePerPartitionFailoverBehavior
set totrue
.Recovery logs through PPAF and PPCB
Failover transition example
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines